Web Document Modelling and Clustering*

نویسنده

  • William Song
چکیده

Great number of the web documents, created by people of a great variety of walks and used by all the people being able to access to the Internet, gives rise to a problem of how to search the Internet to easily obtain what users want and to filter out what they don't. The problem is strongly related to how to describe or characterize the web documents. On the other hand, labels are being introduced to cluster the web documents. These labels are generated by different people using various categorization models so that conflicts may arise. An introduction of a web document model is undoubtedly necessary for analysis and formulation of web documents, hence for formal clustering of the web documents. In the SISU-PICS project, we propose a web document data model (WDM) to represent various characteristics and components for identifying web documents. We also establish a set of relatedness and similarity relations between WDM objects for computing web document clustering.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hierarchical Fuzzy Clustering Semantics (HFCS) in Web Document for Discovering Latent Semantics

This paper discusses about the future of the World Wide Web development, called Semantic Web. Undoubtedly, Web service is one of the most important services on the Internet, which has had the greatest impact on the generalization of the Internet in human societies. Internet penetration has been an effective factor in growth of the volume of information on the Web. The massive growth of informat...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Mediated access to very large document collections

Mediation based on structured specialised collections is proposed as an approach to supporting the exploration of very large document collections such as the Web. The main techniques employed are statistical language modelling and document clustering. The paper discusses the new interaction model and presents experimental results obtained with a prototypical system. References to papers which d...

متن کامل

A density based clustering approach to distinguish between web robot and human requests to a web server

Today world's dependence on the Internet and the emerging of Web 2.0 applications is significantly increasing the requirement of web robots crawling the sites to support services and technologies. Regardless of the advantages of robots, they may occupy the bandwidth and reduce the performance of web servers. Despite a variety of researches, there is no accurate method for classifying huge data ...

متن کامل

Finding Community Base on Web Graph Clustering

Search Pointers organize the main part of the application on the Internet. However, because of Information management hardware, high volume of data and word similarities in different fields the most answers to the user s’ questions aren`t correct. So the web graph clustering and cluster placement in corresponding answers helps user to achieve his or her intended results. Community (web communit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997